Common mitochondrial deletions in RNA-Seq: evaluation of bulk, single-cell, and spatial transcriptomic datasets

Common mitochondrial DNA (mtDNA) deletions are large structural variants in the mitochondrial genome that accumulate in metabolically active tissues with age and have been investigated in various diseases. We applied the Splice-Break2 pipeline (designed for high-throughput quantification of mtDNA deletions) to human RNA-Seq datasets and describe the methodological considerations for evaluating common deletions in bulk, single-cell, and spatial transcriptomics datasets. A robust evaluation of 1570 samples from 14 RNA-Seq studies showed: (i) the abundance of some common deletions detected in PCR-amplified mtDNA correlates with levels observed in RNA-Seq data; (ii) RNA-Seq library preparation method has a strong effect on deletion detection; (iii) deletions had a significant, positive correlation with age in brain and muscle; (iv) deletions were enriched in cortical grey matter, specifically in layers 3 and 5; and (v) brain regions with dopaminergic neurons (i.e., substantia nigra, ventral tegmental area, and caudate nucleus) had remarkable enrichment of common mtDNA deletions.

Mitochondrial DNA (mtDNA) deletions are large structural variants in the mitochondrial genome where a large piece of DNA (often several kilobases or more) is missing [1][2][3][4][5][6] .These mtDNA molecules may or may not have the ability to replicate and can lead to metabolic impairment by disrupting regions that code for proteins, ribosomal and/or transfer RNAs essential for the oxidative phosphorylation (OXPHOS) pathway 3,[7][8][9] .The identification of mtDNA deletion breakpoints and the relative quantification of deleted vs. wild-type mitochondrial genomes have traditionally relied on several approaches, including Southern Blot analysis, Sanger sequencing, and quantitative polymerase chain reaction (qPCR) [10][11][12] .Recently, however, several bioinformatics programs and pipelines have been described to identify mtDNA deletions with high accuracy using next-generation sequencing (NGS) data [13][14][15][16][17] .Our group developed the Splice-Break pipeline for this purpose, and we previously demonstrated its accuracy and applications to human disease and aging research when used on Illumina sequencing data derived from NGS libraries prepared using mitochondrialenriched, long-range PCR products as the DNA input 13 .While PCR amplification of the mitochondrial genome in one or more large fragments is a common approach used by many mitochondrial research groups, it still represents a niche method, and such sequencing data is limited.With the more widespread availability of RNA-sequencing (RNA-Seq) data from a variety of tissues and phenotypes, however, it is of great interest to determine if these datasets can be utilized as a resource for mtDNA deletion investigations.
Large mtDNA deletions are often associated with short repeat sequences proximal to the 5′ and 3′ breakpoints; these repeats lead to an incorrect joining of two distal regions of the mitochondrial genome, either through strand-displacement during mtDNA replication or homologous end-joining during DNA repair processes [30][31][32][33] .Our previous study evaluating mtDNA deletions from brain and blood samples found that all the "Top 30" most frequent mtDNA deletions (which were detected in 69% or more of all samples) were associated with (perfect/imperfect) repeat sequences of 8-22 base pairs in length.These 30 deletions were previously validated by Sanger sequencing and include 12 breakpoints that had been previously described in patients with mitochondrial deletion disorders 13,34 .
MtDNA sequences present in RNA-Seq datasets are likely derived from real, transcribed RNA molecules, but may also be attributed to DNA contamination or "off-target" effects 26,45,46 .The amount of mitochondrial gene transcripts detected in RNA-Seq studies can also be highly variable due to overall RNA quality and library preparation procedures 26,[45][46][47][48][49][50][51] .Due to these concerns, many RNA-Seq and ATAC-Seq bioinformatics pipelines remove mitochondrial reads prior to genome alignment and/or transcript quantification 26,[45][46][47][48][49][50][51] .As such, our first objective was to determine which subset or species of mtDNA deletions demonstrated a significant and positive correlation between RNA-Seq and the aforementioned DNA sequencing processes that utilize long-range PCR amplification of the mitochondrial genome.We PCR-amplified mtDNA from the dorsolateral prefrontal cortex (DLPFC) and processed the NGS data through the traditional Splice-Break2 pipeline 13 , and compared the deletion read %'s to mRNA-Seq data that was previously published by Zeppillo et al. using the same tissue (GSE224683) 20 .Next, we evaluated 14 RNA-Seq studies [18][19][20][21][22][23][24][25][26][27][28][29] to determine how methodological differences in library preparation and sequencing depth affected our ability to detect mitochondrial reads and common mtDNA deletions.Lastly, taking our library preparation observations into account, we evaluated 11 of the 14 RNA-Seq datasets derived from human brain, muscle tissue, liver and blood, and examined if these common deletions significantly correlated with age [20][21][22][23]29 , were enriched in disease phenotypes or brain regions [18][19][20][21][22]24,25,29 , or had differential levels in cortical grey matter layers (I-VI) compared to white matter 26 .Cortical layer evaluations were performed using two spatial transcriptomics datasets, including one dataset of middle temporal gyrus (MTG) that is newly presented here in this study 26 (GSE226663).
Here, we describe the methodological considerations for applying the Splice-Break2 pipeline to RNA-Seq datasets, and present evidence that somatic aging effects, tissue specificity, and metabolic dysfunction can be studied with this approach.Given the wealth of both publicly available and restricted RNA-Seq datasets, this strategy may expedite investigations of common mtDNA deletions, especially for cases where the affected human tissue is not readily available for additional sequencing studies (e.g., rare diseases).Moreover, RNA-Seq studies are already designed to appreciate factors such as tissue and cellular specificity, environmental context, aging and drug exposure, whereas germline DNA studies (e.g., whole genome sequencing (WGS)) often use DNA derived from saliva or blood, which may not accurately recapitulate the somatic structural variation present in metabolically-active tissues or cells 52,53 .Taken together, we believe this study will open a new door to study common mtDNA deletions and their role in human health and disease.
The 30 samples derived from mitochondrial-enriched, long-range PCR products as the DNA input are available through dbGaP (phs002395.v1.p1) as part of a larger investigation of blood and brainderived mtDNA 13,54 .DNA sequencing was performed on mitochondrial enriched PCR-amplicons fragmented by sonication and prepared for NGS using a whole-genome library preparation kit 13,55 .Briefly, ~50 mg of frozen brain tissue was homogenized manually with a pestel; total DNA was extracted using the AllPrep DNA/RNA/Protein Kit (Qiagen, Valencia, CA, USA) and quantified using the Qubit Fluorometer and dsDNA BR Assay Kit (Invitrogen, Carlsbad, CA, USA) according to the manufacturer's instructions.mtDNA was enriched for each sample using a long-range (LR) PCR.The forward primer used was: 5′-CCGCACAAGAGTGCTACTCTCCT C-3′; the reverse primer used was: 5′-GATATTGATTTCACGGA GGATGGTG-3′ (Integrated DNA Technologies, Coralville, IA, USA) 13,55 .Each sample was amplified in a 50 μl PCR reaction that contained the following: 50 ng of total DNA, 1 μl of each 10 uM primer, 8 μl of 2.5 mM dNTPs, 0.5 μl of LA Taq DNA Polymerase, Hot-Start Version (Takara Bio USA, Inc., Mountain View, CA, USA), and 5 μl 10x buffer.Thermocycler parameters were as follows: 94 °C for 1 min, followed by 30 cycles of denaturation at 98 °C for 10 s and annealing/extension at 68 °C for 15 min, with a final extension at 72 °C for 10 min.Reactions were then kept at 4 °C.Following PCR, 5 μl (10% of the total reaction volume) was loaded into a 1% agarose gel containing 10 mg/ml ethidium bromide and the gel was run at 100 V for approximately 2 h to confirm amplification of the full-length mitochondrial genome.PCR products were purified using Agencourt AMPure XP beads (Beckman Coulter, Indianapolis, IN, USA) and were quantified using the Qubit Fluorometer and dsDNA BR Assay Kit (Invitrogen, Carlsbad, CA, USA).DNA shearing was performed in S220 Covaris microTUBEs with the following settings: Duty Factor = 5%, Peak Incident Power = 175 W, Cycles per Burst = 200, Time = 35.200 ng of the sheared LR mitochondrial PCR product was used for library preparation using the TruSeq Nano DNA HT Library Preparation Kit (Illumina, San Diego, CA, USA).Each library was barcoded for multiplex sequencing with 96 samples per lane.NGS libraries were quantified and qualified prior to sequencing using the KAPA Library Quantification Kit (KAPA Biosystems, Wilmington, MA, USA) and Agilent Bioanalyzer DNA 7500 chips (Agilent, Santa Clara, CA, USA), respectively.Libraries were sequenced as 150-mer pairedend reads using the Illumina HiSeq 2500 at the UCI Genomics High Throughput Facility.
All accession numbers for the respective RNA-Seq datasets are provided in the Data Availability, Results section, in Supplementary Data 1 and 2, and are on GitHub (https://github.com/aomidsalar/RNA-Seq_Splice-Break2).

Internal data preparation: bulk RNA-Seq
The internal data obtained for methods comparisons (bulk RNA-Seq and spatial transcriptomics) utilized four subject's brain tissue obtained from the Banner Sun Health Research Institute (BSHRI) Brain and Body Donation Program.These four subjects consisted of one Parkinson's Disease (PD) subject (144113-144114), one Alzheimer's Disease (AD) subject (144111-144112), and two age-matched controls with no neurodegenerative disease (144105-144106 and 144107-144108).Briefly, frozen middle temporal gyrus (MTG) was microdissected into ~5 mm 3 blocks to be compatible with the grid size used for the 10x Visium Spatial Gene Expression platform.Each block was dissected to contain a sulcus, all six cortical layers of the grey matter, and white matter.Frozen tissue was affixed to cryostat chucks with OTC and sectioned with a Microm HM525 Cryostat at −20 °C to 10 μm slices.Frozen tissue slides were adhered to the 10x Visium Spatial Gene Expression or Tissue Optimization slides following the manufacturer's protocols.Additional sections were adhered to Selectfrost microscope slides for bulk RNA-Seq assays.All slides were frozen at −80 °C prior to library preparation.
For bulk RNA-Seq, two frozen slides from one control subject (144107/ 144108) were extracted for RNA using the Direct-Zol RNA mini prep kit.RNA was quantified and qualified using the Agilent Tape Station-HS RNA Screen Tape and Tape Station and had an RNA Integrity Number (RIN) of 5.6.240 ng of RNA was used as input for each bulk RNA-Seq prep, which utilized the NEBNext Ultra II Directional RNA library prep kit.The polyA (mRNA) library preparation used the NEBNext Poly(A) mRNA Magnetic Isolation Module, and the ribo depletion library preparation used the NEBNext rRNA Depletion Kit V2 and SBP beads.Libraries were barcoded with the NEBNext Multiplex Oligos for Illumina.Library size and quality was evaluated using the Agilent Tape Station-HS DNA Screen Tape and Tape Station; libraries were quantified with the KAPA qPCR library quantification kit.Both RNA-Seq libraries were sequenced as paired-end 100-mer reads on an Illumina NovaSeq 6000 system using a NovaSeq S1 flowcell.Sequencing was performed by the USC Keck Genomics Platform (KGP).
For spatial transcriptomics, tissue optimization (i.e., permeabilization time) was tested according to the 10x Genomics protocols, with 12 min of permeabilization selected for spatial transcriptomics sequencing.Two sections of each of the four subjects that were adhered to the 10x Visium Spatial Gene Expression were processed according to the 10x Genomics Visium protocols.Hematoxylin and eosin (H&E) imaging was performed at 20X magnification and auto-stitching using a Keyence BZ-X810 Microscope to generate high-resolution TIFF files.The cDNA was amplified using 12 cycles following Cq determination steps.Library size and quality was evaluated using the Agilent Tape Station-HS DNA Screen Tape and Tape Station; libraries were quantified with the KAPA qPCR library quantification kit.Spatial transcriptomics libraries were sequenced as paired-end 150mer reads on an Illumina NovaSeq 6000 system using a NovaSeq S4 flowcell.Sequencing was performed by the USC Keck Genomics Platform (KGP).
Downloading files FASTQ files available on GEO were obtained using the NCBI tool fastqdump, with command options to split files (--split-files) and append the read ID (-I).The LIBD spatial transcriptomics dataset 26 is available on Globus and Github; FASTQ and BAM files from this study were downloaded directly from the Globus endpoint jhpce#HumanPilot10x (http:// research.libd.org/globus/jhpce_HumanPilot10x/index.html);"filtered_fea-ture_matrix.h5" files and tissue images, which were used for clustering and image annotation, were downloaded from their Github repository (https:// github.com/LieberInstitute/HumanPilot).RNA-Seq FASTQ files from cerebellum (CER), hippocampus (HIPP), and prefrontal cortex (PFC) samples were downloaded directly from the Stanley Neuropathology Consortium website after requesting access to the data portal (http://sncid.stanleyresearch.org/).GTEx samples were obtained from the AnVIL Gen3 repository and downloaded as BAM after gaining access through dbGaP; they were sorted using samtools sort 56 (version 1.6) and converted to FASTQ using bedtools bamtofastq 57 (version 2.25.0).

HISAT2 alignment
In order to isolate reads that uniquely mapped to chrM and did not contain nuclear-mapped reads, we performed HISAT2 58 alignment to the nuclear genome on the RNA-Seq samples from the Zeppillo et al. study 20 .To do this, we first downloaded the Ensembl human genome reference FASTA file version GRCh38.103 59and removed the mitochondrial genome (chrM) from that file.Next, we built the index files for the human genome reference using the command "hisat2-build" 58 .We then performed alignment using HISAT2 58 default settings and the updated human genome reference file (without chrM), with the additional option "--un-conc" to store unmapped reads to separate FASTQ files.Those FASTQ files (containing reads that did not map to the nuclear genome) were then used as the input for Splice-Break2 analysis 13 .
Splice-Break2 analysis RNA-Seq FASTQ files were pushed through Splice-Break2 (https:// github.com/brookehjelm/Splice-Break2),using either the single-end or paired-end version based on the data format of the respective study.Default command line options described were used, apart from the prealignment steps which were skipped (i.e., command line options: --align=yes, --ref=rCRS, fastq_keep=no, --skip_preAlign=yes).For the spatial transcriptomics datasets, alignment (BCL to BAM file) was performed with 10x Genomics Space Ranger (https://support.10xgenomics.com/spatial-gene-expression/software/overview/welcome) and BAM to FASTQ conversion was performed using the 10x Genomics bamtofastq tool (https://support.10xgenomics.com/docs/bamtofastq).The FASTQ files for Read 2 were processed through the single-end version of Splice-Break2.Single-cell samples were pooled by tissue type (via FASTQ file concatenation) prior to being run through Splice-Break2.Initial read numbers for each study were obtained from the stats file (stats.txt);all other metrics used in this analysis regarding the mtDNA read % for the "Top 30" common deletions were obtained from the "<sample > _ LargeMTDeletions_DNAorRNA_Top30_NARpub.txt" files output from Splice-Break2.
Identification of FASTQ reads spanning the 6335-13999, 7816-14807, and 8471-13449 breakpoints We utilized the Unix command "grep" to identify reads that spanned the three deletion breakpoints described.For 6335-13999, we searched for a string containing the deletion breakpoint and ten bases upstream and downstream from that break (CTCCGTAGACCTAACCTGAC).For the 7816-14807 deletion, we searched for an 18 bp string (TCATC-GACCTCCCCACCC) first, followed by strings ATCCTAGT and GAAACTTCGG, which was necessary for specificity.For the 8471-13449 deletion, we searched for a string containing the deletion breakpoint and fifteen bases upstream and downstream from that break (CAAACTACCACCTACCTCCCTCACCATTGG).
To evaluate mtDNA deletions in cortical layers using spatial transcriptomics data that did not have stereological assessment, we conducted cortical layer imputation using the SpatialLIBD results as "ground truth" since that dataset included histology.First, we performed Seurat 60 clustering starting with "filtered_feature_matrix.h5" files on a pool of our eight USC spatial samples along with four SpatialLIBD samples (151673-151676) to generate 10 clusters, as this resulted in the most similar cortical layer architecture to the SpatialLIBD annotation.We performed further subclustering of cluster 3 (using the Seurat 60 command "FindSubCluster", to generate clusters 3_0 and 3_1), which gave us further distinction between cortical layers 2 and 3.The sample barcodes and their assigned clusters were then used to generate cluster-specific BAM files using the script "split_-spatial_bam_per_cluster.py"(provided in our GitHub repository).We compared the proportion of spots that appeared in the eleven Seurat clusters to the seven "ground truth" designations (white matter and cortical layers 1-6) to assign each Seurat cluster to its corresponding imputed layer.We assigned clusters that had over 40 percent of the spots overlapping with "ground truth" annotations to that given cortical layer and omitted clusters 8 and 9 which had an overall frequency of less than 5 percent.BAM files were converted to FASTQ files using the 10x Genomics bamtofastq tool (https:// support.10xgenomics.com/docs/bamtofastq); the cluster-specific FASTQ files were concatenated (when required) to make imputed cortical layer FASTQ files for each sample, which were then run through Splice-Break2.
To find spot barcodes that contained either the 6335-13999 deletion or 8471-13449 deletion, we used the Unix tool "grep" to search for the strings containing the deletion breakpoints (see Methods section "Identification of FASTQ Reads Spanning the 6335-13999, 7816-14807, and 8471-13449 Breakpoints").As a note, the 10x Genomics bamtofastq tool will output 3 types of files for each sample: (1) a "Read 1" file, which contains the UMI and barcode; (2) a "Read 2" file, which contains the cDNA sequence; and (3) an "Index" file, which contains the sample index.Reads that contained each of these deletions (from "grep" results of Read 2) were filtered and crossreferenced with matching headers in the Read 1 FASTQ to output their corresponding barcode sequences; barcodes were then stored in separate csv files (this was done using the script "grepfilestodeletionbarcodes.sh";https:// github.com/aomidsalar/RNA-Seq_Splice-Break2),and these barcodes were used for deletion analyses.Two proportion Z-tests were performed by calculating the proportion of spot barcodes in each imputed cortical layer that contained a "deletion spot"; spot counting and statistics were done manually in Microsoft Excel.
Statistics and reproducibility.Statistical analyses were performed in both R and Microsoft Excel.Spearman's and Pearson correlation tests were done in R using the command "cor.test".Linear regression calculations were made using the "lm" command and ANOVA test were done using the command "aov"; t-tests were done using the "t.test" command in RStudio or "TTEST" formula in Microsoft Excel.Equality of variances was testing using the FTEST formula in Microsoft Excel.Two-proportion Z-tests were calculated manually in Microsoft Excel.All statistical tests where four deletion metrics were evaluated include multiple comparisons corrections using the Bonferroni method; this was done using the command "p.adjust" with option "method = "bonferroni" in R.
All sample sizes for the RNA-Seq studies are shown in the figure legends and in Tables 1 and 2. All replicates analyzed were from different subjects, apart from GSE140089 that included PD subjects pre-and postexercise.All deletion metrics for GEO+ datasets are shown in Supplementary Data 2.

Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Results
Common mtDNA deletions in DNA vs RNA sequencing We used a paired dataset of 30 postmortem brain samples that underwent both DNA and RNA-Sequencing 20 to explore Splice-Break2's utility in analyzing common mtDNA deletions from RNA-Seq data (Fig. 1).These samples came from DLPFC of patients with schizophrenia (SCZ) and healthy controls (CTRL).MtDNA-enriched DNA sequencing was performed as previously described 13,55 .This DNA dataset had an average of 3.47 million reads per sample, with 90.31 ± 5.2% alignment to the revised Cambridge Reference Sequence (rCRS) version of the human mitochondrial genome (NC_012920.1) 64.This resulted in an average MT Benchmark coverage (i.e., mitochondrial depth) of 23,466x.MT benchmark coverage is a measure of the average sequencing depth of the sample, measured from two 250 bp segments within the RNR1 and CYB genes 13 .RNA-Sequencing was performed as described in Zeppillo et al. 20 .This RNA-Seq dataset had an average of 114.5 million reads per sample, with 16.17±5.9%alignment to the mitochondrial genome (NC_012920.1) 64.Spearman and Pearson's correlations between the Splice-Break2 results from the DNA and RNA datasets were performed for three of the "Top 30" most common mtDNA deletions, as well as the summation (cumulative read %) of all these 30 deletions together (Fig. 1).We also evaluated the summation of additional deletions of high frequency (n = 112), but those cumulative read %'s did not have a significant correlation between DNA and RNA (Supplementary Table 1), so we focused exclusively on the "Top 30" deletions for this investigation.The deletion read %'s for the "Top 30" deletions and the cumulative deletion read rate for every sample and this paired RNA/DNA dataset is provided in Supplementary Data 1.
The cumulative deletion read % of these "Top 30" had a significant and positive correlation between the DNA and RNA-Seq datasets after each was normalized to the amount of mitochondrial data (i.e., MT benchmark coverage) (Spearman's rho.0.567, p = 4.33e-3; Pearson's cor.0.454, p = 0.0468) (Fig. 1a).Not surprisingly, the DNA dataset had a significantly higher cumulative deletion read % than the RNA-Seq dataset (p = 1.09e-10), with 124-fold higher levels in the DNA (Fig. 1a).Three individual deletions were also examined for DNA/RNA correlations: the 6335-13999 deletion, which was the most frequently detected deletion in our previous study and was present in 98.9% of samples there, the 7816-14807 deletion, which was the second most frequently detected deletion and was present in 92.5% of samples there, and the 8471-13449 deletion (known as "The Common Deletion"), which was the third most frequently detected deletion in our previous study and was present in 91.4% of samples there 13 .The 6335-13999 deletion did not have statistically significant correlations (Spearman's rho 0.409, p = 0.0993; Pearson's cor.0.235, p = 0.845) after multiple comparisons corrections (Fig. 1b).The DNA dataset also had a significantly higher deletion read % for this deletion than the RNA-Seq dataset (p = 1.35e-8), with 512-fold higher levels in the DNA (Fig. 1b).The 7816-14807 deletion did not have a significant correlation between the DNA and RNA-Seq datasets when evaluating the ranks (Spearman's rho 0.342, p = 0.256); however, the correlation of read values was significant (Pearson's cor.0.466, p = 0.0378) (Fig. 1c).The DNA dataset also had a significantly higher deletion read % for this deletion than the RNA-Seq  20 was processed through the Splice-Break2 pipeline and compared to results using the traditional mtDNA-enrichment and DNA sequencing approach 13,55 .All analyses included correlations and relative abundance of deletion reads detected in each sample (not normalized) and the deletion read rate for each sample (normalized).a The sum of the Top 30 deletions.dataset (p = 7.65e-6), with 196-fold higher levels in the DNA (Fig. 1c).The 8471-13449 deletion had a significant and positive correlation between the DNA and RNA-Seq datasets (Spearman's rho 0.740, p = 1.21e-5;Pearson's cor.0.672, p = 1.88e-4) (Fig. 1d).Again, the DNA dataset had a significantly higher deletion read % than the RNA-Seq dataset (p = 5.54e-7), with 22-fold higher levels in the DNA (Fig. 1d).DNA consistently had a higher deletion read % than the RNA-Seq dataset for each of the "Top 30" deletions, and a higher percentage of the "Top 30" deletions were captured in the DNA data (Supplementary Fig. 1).
The "Top 30" mtDNA deletions breakpoints from Splice-Break2, along with their adjusted positions (aligned to the first base based on the repeat sequences), and the mitochondrial genes that encompass these 5′ and 3′ breakpoints are shown (Fig. 2a).Confirmation and visualization of the three mtDNA deletions analyzed was performed using "grep" and the ggmsa R package, respectively 63 .Reads that contained the sequence CTCCGTA-GACCTAACCTGAC, a 20-bp string that encompasses the 6335-13999 breakpoint, aligned as expected to the COX1 and ND5 genes, with retainment of 1 copy of the repeat sequence (Fig. 2b).Reads that contained the sequence TCATCGACCTCCCCACCC an 18-bp string that encompasses the 7816-14807 breakpoint (and that also contained the sequences ATCCTAGT and GAAACTTCGG), aligned as expected to the COX1 and CYB genes, with retainment of 1 copy of the repeat sequence (Fig. 2c).Reads that contained the sequence CAAACTACCACCTACCTCCCTCAC-CATTGG, a 30-bp string that encompasses the 8471-13449 breakpoint, aligned as expected to the ATP8 and ND5 genes, with retainment of 1 copy of the repeat sequence (Fig. 2d).
To determine if nuclear-mitochondrial DNA segments (NUMTs) were contributing to our detection of common mtDNA deletions, and to assess if it was necessary to process RNA-Seq data with these sequences removed, we compared the mtDNA deletion results from all reads to only those that uniquely mapped to chrM (i.e., after HiSat2 58 alignment to the nuclear genome and downstream processing of unmapped reads) (Supplementary Fig. 2).The correlations between the mtDNA deletion reads detected (before normalization) was very high for both the "Top 30" sum and the three individual deletions evaluated when comparing all reads to only uniquely mapped reads (Spearman's rho range: 0.81 to 0.999, p-value range: 2.40e-7 to 2.72e-42; Pearson's cor range: 0.957-0.998,p-value range: 2.06e-20 to 5.10e-39).There was no significant difference in either the number of deletion reads detected or the deletion read % (after normalization) for any deletion metric (Supplementary Fig. 2).Additional analysis of the uniquely mapped reads from RNA-Seq data compared to the DNA dataset demonstrated that utilizing uniquely mapped reads did not significantly improve the correlations between RNA and DNA (Supplementary Fig. 3); thus, we concluded removal of nuclear mapped reads was not necessary and we processed all RNA-Seq reads through the Splice-Break2 pipeline for the remainder of this study to streamline our workflow.However, both approaches (i.e., using all reads or only chrM uniquely mapped reads) are described in our "RNA-Seq Best Practices for Splice-Break2" document on GitHub.Total reads, mitochondrial data and common mtDNA deletions across 14 RNA-Seq studies We processed 14 human RNA-Seq studies [18][19][20][21][22][23][24][25][26][27][28][29] through the Splice-Break2 pipeline and evaluated the effects of library preparation method and sequencing depth (i.e., total read numbers and MT benchmark coverage) on our ability to capture common mtDNA deletions (Fig. 3).Our analysis included 12 previously published studies and 2 newly presented here.The 12 previously published studies we included are as follows (see Tables 1 and 2 Fig. 4 | Analyses of age in GEO+ samples.Correlations between the Top 30 cumulative deletion, 6335-13999, 7816-14807, and 8471-13449 deletion read rates with age in brain 20,21 and skeletal muscle 22,23   of 357 single cells (neurons, fetal quiescent, astrocytes, endothelial, oligodendrocyte precursor cells, and microglia cell types) isolated from fresh brain tissue 28 ; and (12) Lonsdale et al. (i.e., GTEx) study, evaluating multiple brain regions, skeletal muscle, liver and blood 29 .Single-cell samples were pooled by cell type prior to Splice-Break2 analysis, resulting in 7 pools of pancreas cell types and 6 pools of brain cell types (Table 1).The two newly presented studies include one bulk sequencing dataset of postmortem MTG brain tissue with RNA libraries prepared both with (n = 1) and without (n = 1) ribosomal depletion, as well as one spatial transcriptomics dataset of MTG brain tissue from patients with PD (n = 2), AD (n = 2), and CTRL (n = 4) (GSE226663).One of our internal control samples was sequenced using all three preparation methods; this helped us compare the effect of library preparation method on our ability to capture mtDNA deletions without additional effects of tissue type or sequencing site.In total, we examined 1570 samples ranging in age from prenatal to 96 years across the 14 datasets.
We processed all 1570 samples (Fig. 3a-p), including the one internal (MTG) sample that had been processed three ways (Fig. 3a-h), through the Splice-Break2 pipeline and compared the following key metrics to determine which variables influenced common mtDNA deletion levels: total RNA-Seq reads, mitochondrial (MT) benchmark coverage, MT benchmark coverage per million reads, and the "Top 30" cumulative deletion read    numbers.Total RNA-Seq reads are the initial reads from each sample prior to alignment.MT benchmark coverage is a measure of the average sequencing depth of the sample, measured from two 250 bp segments within the RNR1 and CYB genes 13 .The "Top 30" cumulative deletion read amount is the sum of the "Top 30" reads.The deletion read rates for the internal samples processed three ways are also shown (Fig. 3e-h), which are calculated by normalizing the number of reads for each mtDNA deletion to the MT benchmark coverage, and is the metric used to evaluate the effect of age, tissue, and diagnoses (see Figs. 4-7).The deletion read %'s for the "Top 30" deletions and the cumulative deletion read rate for every sample and all RNA-Seq studies is provided in Supplementary Data 2 for the GEO+ datasets; individual-level data is not shown for the GTEx dataset because of dbGaP restrictions.
We observed that samples that used a ribosomal depletion library preparation method (but had similar abundance of sequencing reads prior to alignment) had less MT benchmark coverage than polyA (nonribosomal depleted) samples (Fig. 3a-c, i-k).Due to a reduced amount of mitochondrial data in the samples subjected to ribosomal depletion, we observed little to no common mtDNA deletions in these samples; conversely, for samples prepared without ribosomal depletion (i.e., bulk polyA or other), we consistently observed these common mtDNA deletions and more mitochondrial data overall (Fig. 3).Samples prepared by LCM showed the highest number of deletion reads (Fig. 3l); however, it is important to note that these LCM samples were all derived from aged human brain regions with high dopaminergic neuron innervation, the SN and VTA, so these high values may not solely reflect the sample preparation method.Evaluations of our internal control samples prepped by three methods further confirmed our observation of higher abundance of mitochondrial reads and deletions in polyA (nonribosomal depleted) preparations and showed more MT data was captured by spatial transcriptomics than by bulk RNA-Seq (Fig. 3b-h).This is in alignment with previous observations that this spatial transcriptomics method (i.e., 10x Genomics Visium) may capture more "offtarget" and/or non-polyadenylated reads compared to other methods 26,66 .Single-cell RNA sequencing (scRNA-Seq) samples did not contain the three mtDNA deletions followed in this study, so these samples were not evaluated further (Table 1, Fig. 3l).Taken together, we found that bulk sequencing with polyA (non-ribosomal depletion), spatial transcriptomics, and LCM methods were amenable to mtDNA deletion investigations, while scRNA-Seq and ribosomal depletion methods were generally inutile due to insufficient mitochondrial data.
Common mtDNA deletions and the effects of age and sex MtDNA mutations and deletions have been positively associated with aging in various tissues such as the brain, skeletal muscle, and heart 4,5,43,[67][68][69][70] .One of the most investigated large mtDNA deletions is the 8471-13449 "common deletion" (also referred to as the 4977 bp deletion), which was found to increase with age in the heart and brain of healthy adult patients 43,69,70 .In contrast with the nuclear genome, the mitochondrial genome is particularly vulnerable to DNA damage because it is located in close proximity to damaging free radicals and reactive oxygen species generated through the OXPHOS pathway, does not have the protection of histones, and has less efficient DNA repair machinery [71][72][73] .Over time, these factors can lead to the accumulation of mtDNA deletions within cells 70,71,73 .
Among the GEO+ datasets, we chose four studies for age analysis because of the wide age distributions of their samples [20][21][22][23] .In the DLPFC study 20 , we observed significant positive correlations between age and the "Top 30" cumulative deletion read rate (p = 0.0119) and the 8471-13449 "common deletion" (p = 0.0159) (Fig. 4a).In the SM study from PD and CTRL samples 22 , we similarly observed significant positive correlations between age and the "Top 30" cumulative deletion read rate (p = 0.0113) as well the 8471-13449 "common deletion" (p = 0.021; Fig. 4b).In the SM study of CTRL samples 23 , we did not observe a significant correlation between age and any of the four mtDNA deletion metrics evaluated (Fig. 4c); only ~1/2 of the samples were included in the aging analysis because we observed a significant "batch effect" in this study (Supplementary Fig. 4).In the Stanley Neuropathology Consortium dataset 21 , which contained CER, HIPP, and PFC, we observed significant positive correlations between age and the "Top 30" cumulative deletion read rate (p = 0.0131) and the 8471-13449 "common deletion" (p = 0.0193) in the PFC (Fig. 4f); these deletion metrics were also observed to increase with age in the CER and HIPP, but the correlations with age were not significant after multiple-comparisons corrections (Fig. 4d, e).We did not observe a significant correlation with age for the 6335-13999 or 7816-14807 deletion in any of these datasets (Fig. 4).In addition, none of these datasets exhibited significant differences between biological sexes for any of the four deletion read rate metrics (Supplementary Table 2).
Age analysis for the GTEx RNA-Seq data was also performed on a subset of available samples, including 183 paired samples of cerebellum (CER) and cortex (CORT), 41 paired samples from multiple brain regions (i.e., amygdala (AM), anterior cingulate cortex (ACC), caudate nucleus (CAUD), frontal cortex (FC), hippocampus (HIPP), and substantia nigra (SN)), and 165 paired samples from non-brain regions (i.e., blood, liver and skeletal muscle (SM)) (Fig. 5 and Supplementary Fig. 5).In the paired CER and CORT dataset, we observed significant positive correlations between age and the "Top 30" cumulative deletion read rate (p = 0.00238) and the 8471-13449 "common deletion" (p = 5.88e-4) in the CORT, and also a significant positive correlation between age and the "Top 30" cumulative deletion read rate (p = 0.0342) in the CER (Fig. 5).The other common deletions evaluated were not significant for these brain regions after multiple comparisons corrections (Fig. 5 and Supplementary Fig. 5).In the paired analysis of 6 brain regions, we observed significant positive correlations between age and the mtDNA deletion metrics for the AM, CAUD and SN; we did not observe significant correlations with age in the ACC, FC or HIPP after multiple comparisons corrections (Fig. 5 and Supplementary Fig. 5).Significant age correlations in the AM included the 8471-13449 deletion (p = 0.0172) and the "Top 30" cumulative deletion read rate (p = 0.0351).Significant age correlations in the CAUD included the 6335-13999 deletion (p = 1.94e-3), the 7816-14807 deletion (p = 0.0358), the 8471-13449 deletion (p = 1.11e-5), and the "Top 30" cumulative deletion read rate (p = 1.20e-4).Significant age correlations in the SN included the 8471-13449 deletion (p = 0.00896), and the "Top 30" cumulative deletion read rate (p = 0.0182).Lastly, in the paired analysis of blood, liver and SM, we only observed significant and positive correlations between age and the mtDNA deletion metrics for SM, specifically for the 8471-13449 deletion (p = 0.0210), and the "Top 30" cumulative deletion read rate (p = 0.0210) (Fig. 5).Overall, we were able to recapitulate previously published findings that common mtDNA deletions increase with age in the brain and muscle, and we provide evidence from paired datasets that different brain regions and tissues have variable susceptibility to this increase, some of which, to our knowledge, is illustrated for the first time.
Common mtDNA deletions across brain regions, tissues, and the effect of diagnosis Evaluation of the MTG, AM, and SN dataset 18 of PD and CTRL subjects revealed no significant differences in "Top 30" cumulative deletion read % in the SN compared to the MTG or AM for any of the deletion metrics, after multiple comparisons corrections (Fig. 6a).However, the SN did display higher levels and it should be emphasized that this study used a ribosomal depletion method for library preparation, so it is perhaps not surprising we were unable to detect a significant difference since so little mitochondrial data is captured with that RNA-Seq method.The SN and VTA LCM dataset 24 from PD and CTRL samples (Fig. 6b) and the dorsal and ventral SN LCM dataset 25 of CTRL subjects (Fig. 6c) showed no significant differences in any of the four mtDNA deletion metrics we evaluated.The Stanley Neuropathology Consortium dataset 21 of CER, PFC, and HIPP tissue showed significantly higher deletion read %'s in the HIPP compared to CER and PFC for the 7816-14807 deletion (p ≤ 0.01324) and the "Top 30" cumulative deletion read rate (p ≤ 3.12e-12; Fig. 6d).The HIPP had significantly higher deletion read %'s than the CER for the 6335-13999 deletion (p = 0.00636) and the 8471-13449 deletion (p = 2.92e-5) (Fig. 6d).The 8471-13449 deletion were also significantly higher in the PFC than in the CER (p = 0.0261) (Fig. 6d).
In general, we did not detect significant differences based on diagnosis in any of the GEO+ datasets we evaluated.The one exception to this was the SN and VTA LCM dataset 24 from PD and CTRL samples had significantly higher levels of the 7816-14807 deletion in the CTRL subjects in both SN (p = 0.0109) and VTA (p = 0.0064) (Fig. 6f and Supplementary Fig. 6).It should be noted that all of these studies had a small sample size per diagnostic group (range: n = 5-17).

Common mtDNA deletions across imputed cortical layers in spatial transcriptomics data
To investigate the common mtDNA deletion metrics across (imputed) cortical layers, we used 12 spatial transcriptomics samples from two separate human brain datasets.The two spatial transcriptomics datasets include the Maynard et al. "SpatialLIBD" dataset 26 , from which we included 4x section replicates of DLPFC tissue from a single healthy control subject (sample numbers 151673-151676), and a new internal dataset of 8x spatial transcriptomics samples of MTG tissue taken from duplicate sections of 4 patients (diagnosed as AD, PD, or CTRL).Both datasets were prepared according to 10x Genomics's Visium spatial transcriptomics library preparation and sequencing protocols.The SpatialLIBD samples were included to assist in cortical layer imputation since they contained "ground truth" measures of cortical layer geography based on stereology/imaging analysis 26 .We performed Seurat 60 clustering on all 12 spatial samples from both studies and imputed cortical layers based on the Seurat clusters that had the highest overlap with the SpatialLIBD ground truth annotations; this resulted in the following imputed cortical layers: "White Matter", "Layer 1", "Layer 2 + 3", "Layer 3", "Layer 5", and "Layer 6" (Fig. 8a-c).
To determine the proportion of spots in each cortical layer that contained the 6335-13999 and/or 8471-13449 mtDNA deletion, we needed the spot barcode data (which is not output as part of the Splice-Break2 process).As such, we used "grep" to identify reads that spanned these breakpoints, as described previously, and then determined which cortical layer these reads mapped to using the spot barcode.The percentage of spots that contained the 6335-13999 deletion (Fig. 8d) or the 8471-13449 deletion (Fig. 8e) for all 12 samples was analyzed for each imputed cortical layer.We observed that the 6335-13999 deletion had a significantly higher percentage of spots (twoproportion Z-test) in imputed cortical layers 3 and 5 compared to the white matter (p ≤ 0.0066; Fig. 8d).The 8471-13449 "common deletion" also had significantly higher geographical representation in cortical layers 3 and 5 compared to white matter (p ≤ 3.30e-5; Fig. 8e).These spots containing deletions were also annotated on tissue images for visualization (Supplementary Fig. 8).Further, we observed similar trends in the percentage of spots with the 6335-13999 or 8471-13449 deletion when analyzing just the USC internal samples (Supplementary Fig. 9).These 8x spatial samples were split according to their imputed cortical layer (i.e., using "cluster BAMS" from Seurat 60 ), and each imputed layer was run through Splice-Break2; we found a significant difference between white and grey matter for the 8471-13449 "common deletion" (p = 0.0393; Supplementary Fig. 9).Taken together, these results suggest cortical layers have different levels of common mtDNA deletions in the aged human brain and are most abundant in layers 3 and 5 where there is an enrichment of pyramidal neurons 26,74,75 .

Discussion
The aim of this study was to determine whether RNA-Seq data can be used to evaluate common mtDNA deletion levels in somatic tissues using the bioinformatics tool Splice-Break2.Initial analyses were done on a paired dataset where there was DNA and RNA-Seq data available; we observed robust differences in DNA and RNA deletion read %, with the DNA data containing at least 22-fold higher deletion read rates.We hypothesize that the decreased rate of these common deletions in RNA-Seq data may be due to transcript abundance or stability-that mitochondrial genomes with large deletions may transcribe polycistronic transcripts less efficiently than wildtype molecules, or that RNA transcripts containing deletion breakpoints may be less stable, or both.It is worth mentioning that polyadenylation of mitochondrial transcripts serves a different role than it does for nuclear-encoded genes where 3′ polyadenylation helps stabilize mRNA molecules and promotes protein translation; this also occurs for human mitochondrial-encoded genes, but transcripts can also be polyadenylated at many intragenic positions and are subject to polyadenylation-dependent RNA degradation mechanisms that are conserved features of bacteria (the evolutionary origin of mitochondria) [76][77][78] .In addition, the deletion read %'s in the DNA data are likely increased due to the long-range PCR amplification used for mitochondrial enrichment and the smaller size of the deleted genomes.
The mtDNA deletion metrics that were examined in this study were the cumulative read % of the "Top 30" most frequent deletions that we identified and Sanger-validated in our previous study, and the individual read % of three specific mtDNA deletions: 6335-13999 (adjusted position 6341-14005), 7816-14807 (adjusted position 7814-14805), and 8471-13449 (the "common deletion"; adjusted position 8482-13460).The cumulative read % of the "Top 30" deletions and the 8471-13449 deletion had statistically significant positive correlations between the DNA and RNA-Seq paired dataset examined, with the "common deletion" having the strongest correlations between methods.Future studies may include additional paired datasets (DNA, RNA-Seq, and other genomics methods) and analyses of less common mitochondrial deletions.Analysis of patients with large clonal deletions will also be of interest.The Splice-Break2 pipeline does output additional breakpoints besides the "Top 30" most common deletions, but those metrics should be used with caution for RNA-Seq data and may require further validation (such as qPCR and/or Sanger sequencing).
All Splice-Break2 data is reported as "deletion read %", which is the number of deletion reads detected divided by the MT Benchmark coverage; however, this should not be interpreted as an absolute estimation of heteroplasmy rate (for DNA or RNA-Seq) because of inherent biases of these NGS methods.In our original methods article, we did observe a significant positive correlation between the Splice-Break2 deletion read % and qPCR quantification of the 8471-13449 "common deletion", and here we show the deletion read %'s can correlate between DNA and RNA-Seq datasets-thus, we conclude that although absolute heteroplasmy rates cannot be inferred, the fold differences between samples is retained and allows us to use deletion read % for statistical analyses of age, tissue type and diagnoses.Future studies that pair mitochondrial/metabolic assays with NGS measurements will be important to understand the physiological relevance of these deletions and at what threshold deletion read %'s impact cellular function.
Splice-Break2 was used to examine various library preparation methods, such as bulk RNA-Seq, LCM RNA-Seq, spatial transcriptomics, and scRNA-Sequencing.We observed that bulk sequencing without ribosomal depletion allowed for more consistent capturing of mtDNA transcripts and deletions compared to bulk sequencing with ribosomal depletion.Additionally, the scRNA-Seq studies we evaluated had minimal to no mtDNA deletion capture despite containing comparable amounts of mtDNA transcripts (MT Benchmark coverage) compared to the other studies evaluated.Analysis of additional scRNA-Seq datasets and methods, including LCM, along with other non-RNA single-cell genomics methods (such as scATAC-Seq), is warranted in the future.The LCM RNA-Seq samples we evaluated contained the highest deletion read rates; however, it is important to note that these samples originated from aged brain tissue from regions innervated with dopaminergic neurons (SN and VTA), which have a demonstrated significant increase in mtDNA deletion burden 79,80 .Therefore, we cannot determine from these LCM datasets if the increased mtDNA deletion levels are strictly due to brain region, library preparation method, or are a combination of both effects.Overall, these data indicate that the most valuable RNA-Seq wet lab protocols for mtDNA deletion detection include bulk sequencing without ribosomal depletion (e.g., polyA), LCM RNA-Seq, and spatial transcriptomics.
Our analysis of aging, differences among brain regions, and diagnostic effects revealed trends that were consistent with the effects of mitochondrial dysfunction reported in current literature [2][3][4][5]9,79,80 . It ha been reported in several studies that mtDNA deletions accumulate in some tissues as they age, and studies on the "hallmark" mtDNA deletion disorders (e.g., KSS, Pearson's Syndrome) have proved these deletions can have functional (and even lethal) consequences if at a high enough threshold to affect cellular function 36,37 .The age-dependent increase in mtDNA deletions has been observed in various somatic tissues including the brain; in the substantia nigra, high mtDNA deletions have been linked to age-related and diseaserelated (i.e., PD) neuronal loss 43,67,79,80 .We observed statistically significant increases in the mtDNA deletion metrics due to age in brain and/or muscle datasets.Our most robust evaluation of aging was performed using paired RNA-Seq data from GTEx.Overall, we were able to recapitulate previously published findings that common mtDNA deletions increase with age in the brain and muscle, and show that different brain regions and tissues have variable susceptibility to this age-dependent increase.
We were able to observe some interesting differences in tissues and brain regions in the paired GTEx dataset, where common mitochondrial deletions were lowest in blood, lower in liver than skeletal muscle, higher in cortex than in cerebellum, and high (but variable) across cortical brain regions.In both GEO+ and GTEx datasets, we observed increased mtDNA deletions in brain regions containing dopaminergic neurons (SN, VTA and caudate nucleus); however, we did not detect an increase in PD.This is consistent with previous reports that have found an increase in SN tissue but no significant increase in PD specifically 79,[81][82][83] .It would be interesting to investigate a larger cohort of patient samples taken at various stages of PD and/or compare these results neuropathological measures of dopaminergic cell death (e.g., SN depigmentation scores), as this data suggests the age effects in PD SN may not be linear.Measures of cell type abundance (e.g., histology, in situ gene expression, cell counts from flow cytometry or CBC tests, or single-cell RNA-Seq data) from paired tissue (preferably the same dissection/sample) maybe also improve analyses of age and diagnosis if that data can be obtained and used as a co-variate.
Our common mtDNA deletion analysis of spatial transcriptomics data, specifically across (imputed) cortical layers of the DLPFC and MTG, revealed increased deletion burden in grey matter compared to white matter, which is not surprising given its high metabolic activity 74,84,85 .Cortical layers 3 and 5 consistently contained the highest percentage of "spot barcodes" with mtDNA deletions.This supports the hypothesis that brain regions and cortical layers are differentially susceptible to mtDNA damage 69,79,80,86 .Future studies investigating the effect of mtDNA damage in neurodevelopment, psychiatric disorders or neurodegenerative diseases may want to focus on spatial transcriptomic approaches so that analyses can focus on specific cortical layers and increase the signal-to-noise ratio for these tests.
In summary, this robust analysis of multiple, highly used RNA-Seq methods demonstrated the utility of detecting mtDNA deletions of high frequency through our bioinformatics pipeline.The Splice-Break2 tool was effective in quantifying common mtDNA deletions in polyA (non-ribosomal depleted) bulk sequencing, LCM, and spatial transcriptomic datasets, and these datasets may also be amenable to other bioinformatics approaches of mtDNA deletion quantification.Of note, the current version of Splice-Break2 is only compatible with human data, as the alignment and annotations utilize the rCRS 64 , our catalogue of human mtDNA deletion breakpoints 13 , and MitoMap 87 , respectively.With the wide breadth of publicly available and restricted human RNA-Seq datasets that encompass a variety of tissues, diseases, and experimental/environmental conditions, the ability to incorporate mtDNA deletion metrics into these investigations may provide information about metabolic effects in tissues and their contributions to disease phenotypes and aging.

Data availability
Our analysis included 12 previously published studies and 2 newly presented here.The 12 previously published studies we included in this paper can be accessed from the following: (1) Simchovitz et al. 18 study is deposited on GEO with accession code "GSE114517"; (2) Nativio et al. 19 study is deposited on GEO with accession code "GSE159699"; (3) Zeppillo et al. 20 study is deposited on GEO with accession code "GSE224683"; (4) Kim et al. study 21,65 (i.e., The Stanley Neuropathology Consortium) is available on their website (http:// sncid.stanleyresearch.org);5) Lavin et al. study 22 is deposited on GEO with accession code "GSE140089"; (6) Tumasian et al. study 23 is deposited on GEO with accession code "GSE164471"; (7) Aguila et al. study 24 is deposited on GEO with accession code "GSE114918"; (8) Monzón-Sandoval et al. study 25 is deposited on GEO with accession code "GSE166024"; (9) Maynard et al. study 26 is available on their GitHub page (https://github.com/LieberInstitute/HumanPilot) and Globus endpoint "jhpce#HumanPilot10x" (http:// research.libd.org/globus/jhpce_HumanPilot10x/index.html);(10) Enge et al. study 27 is deposited on GEO with accession code "GSE81547"; (11) Darmanis et al. study 28 is deposited on GEO with accession code "GSE67835"; and (12) Lonsdale et al. (i.e., GTEx) study 29 is available through their portal (https://gtexportal.org/home/) 29 .All RNA-Seq samples newly presented in this study have been deposited on GEO with primary accession code "GSE226663".Source data underlying Fig. 1 and Fig. 3a-h can be found in Supplementary Data 1.Deletion metrics for all GEO+ studies (used for Figs. 3, 4 and 6) can be found in Supplementary Data 2. Source data underlying Fig. 8d can be found in Supplementary Data 3. Raw data is not provided for the GTEx datasets due to restricted access.Arizona Parkinson's Disease Consortium) and the Michael J. Fox Foundation for Parkinson's Research.The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS.Mitochondrial DNA data used for this analysis are available in: dbGaP accession number phs002395.v1.p1.The GTEx data used for the analyses described in this manuscript were obtained from: dbGaP accession number phs000424.vN.pN on 07/17/2023.We would like to thank the laboratory of Leonardo Morsut (USC) for the use of their microscope.We would also like to thank the authors of the various human RNA-Seq studies analyzed here for making their data publicly available to the research community.In addition, we are always greatly appreciative to all the patients and tissue donors, as well as their families, without whom this work would not be possible.

Fig. 1 |
Fig.1| Correlations and comparisons of mtDNA deletions captured by DNA vs RNA-Seq.RNA-Seq data from 30 brain samples20 was processed through the Splice-Break2 pipeline and compared to results using the traditional mtDNA-enrichment and DNA sequencing approach13,55 .All analyses included correlations and relative abundance of deletion reads detected in each sample (not normalized) and the deletion read rate for each sample (normalized).a The sum of the Top 30 deletions.The three most common deletions: (b) 6335-13999, (c)7816-14807, and (d) Fig.1| Correlations and comparisons of mtDNA deletions captured by DNA vs RNA-Seq.RNA-Seq data from 30 brain samples20 was processed through the Splice-Break2 pipeline and compared to results using the traditional mtDNA-enrichment and DNA sequencing approach13,55 .All analyses included correlations and relative abundance of deletion reads detected in each sample (not normalized) and the deletion read rate for each sample (normalized).a The sum of the Top 30 deletions.The three most common deletions: (b) 6335-13999, (c) 7816-14807, and (d) 8471-13449.Spearman and Pearson's correlations are shown.Statistical values for box plots are from Welch's t-tests.All p-values were corrected for multiple tests using Bonferroni.All boxplots show the median as a solid black line; the first and third quartiles are captured by the bounds of the box.Boxplot whiskers are defined as the first and third quartiles ± interquartile range times 1.5, respectively, and outliers are denoted as points.

Fig. 2 |
Fig. 2 | MtDNA deletions captured by RNA-Seq.a The Top 30 most frequent mtDNA deletions evaluated in this study and previously described 13 .Multiple sequence alignment (MSA) plot of RNA-Seq reads from a dataset of 30 brain

Fig. 5 |
Fig.5| Analyses of age in GTEx samples.Correlations between the Top 30 cumulative deletion and 8471-13449 deletion read rates with age in 11 GTEx tissues29 .a The cumulative deletion read rate of the "Top 30" mtDNA deletions and (b) the 8471-13449 deletion.Tissues are from three paired datasets: 183 paired samples of cerebellum and cortex, 41 paired samples from multiple brain regions (i.e., amygdala, anterior cingulate cortex, caudate nucleus, frontal cortex, hippocampus, and substantia nigra), and 165 paired samples from non-brain regions (i.e., blood, liver, and skeletal muscle).P-values shown from linear regression models for Deletion ~Age, and include MT benchmark coverage and sex as co-variates.All pvalues were corrected for multiple tests using Bonferroni.

Fig. 7 |
Fig. 7 | Analysis of brain region and tissue in GTEx samples.a-c Comparisons between brain regions and tissues for the "Top 30" cumulative deletions, 6335-13999, 7816-14807, and 8471-13449 deletion in paired GTEx datasets 29 .Bar graphs represent mean ± SEM. d, e Comparisons across all 11 GTEx tissues.Sample size per tissue: (a) paired samples from cerebellum and cortex (n = 183 ea.);(b) paired samples from amygdala, anterior cingulate cortex, caudate nucleus, frontal cortex, hippocampus, and substantia nigra (n = 41 ea.); (c) paired samples from blood, liver, and skeletal muscle (n = 165 ea.).d the "Top 30" cumulative deletions for all GTEx tissues and matrix of p-values for individual comparisons.e The 8471-13449 deletion for all GTEx tissues and matrix of p-values for individual comparisons.Statistical values for paired tests (a-c) are from repeated measures ANOVA.Statistical values for individual tissue comparisons(d, e) are from Welch's t-tests.All p-values were corrected for multiple tests using Bonferroni.All boxplots show the median as a solid black line; the first and third quartiles are captured by the bounds of the box.Boxplot whiskers are defined as the first and third quartiles ± interquartile range times 1.5, respectively, and outliers are denoted as points.CER (cerebellum); CORT (cortex); FC (frontal cortex); HIPP (hippocampus); ACC (anterior cingulate cortex); AM (amygdala); CAUD (caudate nucleus); SN (substantia nigra); SM (skeletal muscle).

Table 2 |
Summary of GTEx RNA-Seq datasets evaluated for mtDNA deletions SE single-end, PE paired-end, M million.